102 research outputs found

    Identification of patient classes in low back pain data using crisp and fuzzy clustering methods

    Get PDF
    We performed a cluster analysis of the low back pain dataset in the framework of the IFCS-2017 data challenge. Because the original data contained missing values, the first part of our analysis concerned the imputation of missing values using the Fully Conditional Specification model. The Local Outlier Factor method was then used to detect and eliminate the outliers. After the data normalization, we removed highly correlated variables from the transformed dataset and carried out k-means clustering of the remaining variables based on their correlations, i.e., the variables with the highest mutual correlations were assigned to the same cluster. Once the variables were assigned to different clusters, one representative per cluster, i.e., the variable with the highest contribution score at the first principal component, was selected. Among the 13 selected variables, there are representatives of each of the 6 variable domains (contextual factor, participation, pain, psychological, activity and physical impairment), specified as important in the paper by Nielsen et al. (2016). Different clustering methods, including DAPC, k-means and k-medoids, were then carried out to cluster the reduced low back pain data. Consensus solutions, both crisp and fuzzy, were calculated using the GV3 method. The obtained crisp consensus clustering, including 5 classes, was described in detail and compared to the meta-data annotation

    Building explicit hybridization networks using the maximum likelihood and Neighbor-Joining approaches

    Get PDF
    Tree topologies are the simplest structures which can be used to represent the evolution of species. Over the two last decades more complex structures, called phylogenetic networks, have been introduced to take into account the mechanisms of reticulate evolution, such as species hybridization and horizontal gene transfer among bacteria and viruses. Several algorithms and software have been developed in this context, but most of them yield as output only an implicit network, which can be difficult to interpret. In this paper, we introduce a new algorithm for inferring explicit hybridization networks from binary data. In order to build our explicit hybridization networks, we use a maximum likelihood approach applied to Neighbor-Joining tree configurations

    Systematic error detection in experimental high-throughput screening

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>High-throughput screening (HTS) is a key part of the drug discovery process during which thousands of chemical compounds are screened and their activity levels measured in order to identify potential drug candidates (i.e., hits). Many technical, procedural or environmental factors can cause systematic measurement error or inequalities in the conditions in which the measurements are taken. Such systematic error has the potential to critically affect the hit selection process. Several error correction methods and software have been developed to address this issue in the context of experimental HTS <abbrgrp><abbr bid="B1">1</abbr><abbr bid="B2">2</abbr><abbr bid="B3">3</abbr><abbr bid="B4">4</abbr><abbr bid="B5">5</abbr><abbr bid="B6">6</abbr><abbr bid="B7">7</abbr></abbrgrp>. Despite their power to reduce the impact of systematic error when applied to error perturbed datasets, those methods also have one disadvantage - they introduce a bias when applied to data not containing any systematic error <abbrgrp><abbr bid="B6">6</abbr></abbrgrp>. Hence, we need first to assess the presence of systematic error in a given HTS assay and then carry out systematic error correction method if and only if the presence of systematic error has been confirmed by statistical tests.</p> <p>Results</p> <p>We tested three statistical procedures to assess the presence of systematic error in experimental HTS data, including the χ<sup>2 </sup>goodness-of-fit test, Student's t-test and Kolmogorov-Smirnov test <abbrgrp><abbr bid="B8">8</abbr></abbrgrp> preceded by the Discrete Fourier Transform (DFT) method <abbrgrp><abbr bid="B9">9</abbr></abbrgrp>. We applied these procedures to raw HTS measurements, first, and to estimated hit distribution surfaces, second. The three competing tests were applied to analyse simulated datasets containing different types of systematic error, and to a real HTS dataset. Their accuracy was compared under various error conditions.</p> <p>Conclusions</p> <p>A successful assessment of the presence of systematic error in experimental HTS assays is possible when the appropriate statistical methodology is used. Namely, the t-test should be carried out by researchers to determine whether systematic error is present in their HTS data prior to applying any error correction method. This important step can significantly improve the quality of selected hits.</p

    Building alternative consensus trees and supertrees using k-means and Robinson and Foulds (RF) distance

    Full text link
    Each gene has its own evolutionary history which can substantially differ from the evolutionary histories of other genes. For example, some individual genes or operons can be affected by specific horizontal gene transfer and recombination events. Thus, the evolutionary history of each gene should be represented by its own phylogenetic tree which may display different evolutionary patterns from the species tree that accounts for the main patterns of vertical descent. The output of traditional consensus tree or supertree inference methods is a unique consensus tree or supertree. We describe a new efficient method for inferring multiple alternative consensus trees and supertrees to best represent the most important evolutionary patterns of a given set of gene phylogenies. We show how an adapted version of the popular k-means clustering algorithm, based on some interesting properties of the Robinson and Foulds distance, can be used to partition a given set of trees into one (for homogeneous data) or multiple (for heterogeneous data) cluster(s) of trees. Moreover, we adapt the popular Cali\'nski-Harabasz, Silhouette, Ball and Hall, and Gap cluster validity indices to tree clustering with k-means. A special attention is given to the relevant but very challenging problem of inferring alternative supertrees. The use of the Euclidean property of the objective function of the method makes it faster than the existing tree clustering techniques, and thus perfectly suitable for analyzing large evolutionary datasets. We apply the new method to discover alternative supertrees characterizing the main patterns of evolution of SARS-CoV-2 and genetically related betacoronaviruses.Comment: submitte

    Inferring explicit weighted consensus networks to represent alternative evolutionary histories

    Get PDF
    Background: The advent of molecular biology techniques and constant increase in availability of genetic material have triggered the development of many phylogenetic tree inference methods. However, several reticulate evolution processes, such as horizontal gene transfer and hybridization, have been shown to blur the species\ud evolutionary history by causing discordance among phylogenies inferred from different genes.\ud Methods: To tackle this problem, we hereby describe a new method for inferring and representing alternative(reticulate) evolutionary histories of species as an explicit weighted consensus network which can be constructed from a collection of gene trees with or without prior knowledge of the species phylogeny.\ud Results: We provide a way of building a weighted phylogenetic network for each of the following reticulation\ud mechanisms: diploid hybridization, intragenic recombination and complete or partial horizontal gene transfer. We successfully tested our method on some synthetic and real datasets to infer the above-mentioned evolutionary events which may have influenced the evolution of many species.\ud Conclusions: Our weighted consensus network inference method allows one to infer, visualize and validate statistically major conflicting signals induced by the mechanisms of reticulate evolution. The results provided by the new method can be used to represent the inferred conflicting signals by means of explicit and easy-to-interpret phylogenetic networks

    A new effective method for estimating missing values in the sequence data prior to phylogenetic analysis

    Get PDF
    In this article we address the problem of phylogenetic inference from nucleic acid data containing missing bases. We introduce a new effective approach, called “Probabilistic estimation of missing values” (PEMV), allowing one to estimate unknown nucleotides prior to computing the evolutionary distances between them. We show that the new method improves the accuracy of phylogenetic inference compared to the existing methods “Ignoring Missing Sites” (IMS), “Proportional Distribution of Missing and Ambiguous Bases” (PDMAB) included in the PAUP software [26]. The proposed strategy for estimating missing nucleotides is based on probabilistic formulae developed in the framework of the Jukes-Cantor [10] and Kimura 2-parameter [11] models. The relative performances of the new method were assessed through simulations carried out with the SeqGen program [20], for data generation, and the Bio NJ method [7], for inferring phylogenies. We also compared the new method to the DNAML program [5] and “Matrix Representation using Parsimony” (MRP) [13], [19] considering an example of 66 eutherian mammals originally analyzed in [17]

    On k-means iterations and Gaussian clusters

    Get PDF
    Nowadays, k-means remains arguably the most popular clustering algorithm (Jain, 2010; Vouros et al., 2021). Two of its main properties are simplicity and speed in practice. Here, our main claim is that the average number of iterations k-means takes to converge (τ¯) is in fact very informative. We find this to be particularly interesting because τ¯ is always known when applying k-means but has never been, to our knowledge, used in the data analysis process. By experimenting with Gaussian clusters, we show that τ¯ is related to the structure of a data set under study. Data sets containing Gaussian clusters have a much lower τ¯ than those containing uniformly random data. In fact, we go considerably further and demonstrate a pattern of inverse correlation between τ¯ and the clustering quality. We illustrate the importance of our findings through two practical applications. First, we describe the cases in which τ¯ can be effectively used to identify irrelevant features present in a given data set or be used to improve the results of existing feature selection algorithms. Second, we show that there is a strong relationship between τ¯ and the number of clusters in a data set, and that this relationship can be used to find the true number of clusters it contains

    Evolutionary history of bacteriophages with double-stranded DNA genomes

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Reconstruction of evolutionary history of bacteriophages is a difficult problem because of fast sequence drift and lack of omnipresent genes in phage genomes. Moreover, losses and recombinational exchanges of genes are so pervasive in phages that the plausibility of phylogenetic inference in phage kingdom has been questioned.</p> <p>Results</p> <p>We compiled the profiles of presence and absence of 803 orthologous genes in 158 completely sequenced phages with double-stranded DNA genomes and used these gene content vectors to infer the evolutionary history of phages. There were 18 well-supported clades, mostly corresponding to accepted genera, but in some cases appearing to define new taxonomic groups. Conflicts between this phylogeny and trees constructed from sequence alignments of phage proteins were exploited to infer 294 specific acts of intergenome gene transfer.</p> <p>Conclusion</p> <p>A notoriously reticulate evolutionary history of fast-evolving phages can be reconstructed in considerable detail by quantitative comparative genomics.</p> <p>Open peer review</p> <p>This article was reviewed by Eugene Koonin, Nicholas Galtier and Martijn Huynen.</p
    corecore